Presentation

Several models were estimated, including linear, (linear) robust and non-linear (Poisson and Negative Binomial). It was found that the linear approach employing a CooksD-corrected dataset did the best job minimizing the RMSE. The results will be displayed in .pdf and .html. The .pdf file has fixed margins, hence not all outputs will show up (particularly, the super wide stepwise regression output). The .rmd file contains all reproducible code, and I took really good care the file is absolutely self-contained. There is also a GitHub repo (https://github.com/hbahamonde/trivago) where you can follow all my commits (and see my other work too, if you want). For convenience, I have also published the html version of this assignment (code and output) in my website (www.HectorBahamonde.com/Trivago). Since Zealpath does not allow me to send a fourth file (the .html file), you will need to check that out visiting my website, if so desired. That’s important though, as wide output won’t show up properly in the .pdf (it will in the .rmd file, though). Feel free to ask me any questions.

Introduction

Lets begin fresh:

rm(list=ls()) # cleans environment
graphics.off() # kills all plots
cat("\014") # cleans the console

Now, lets load the pacman package (this package conditionally loads uninstalled packages). The package avoids installing already installed packages. The way to load the package and library is as follows p_load(PACKAGENAME). Lets install pacman.

if (!require("pacman")) install.packages("pacman"); library(pacman)

Lets load the dataset:

p_load(foreign)
dat <- read.csv("data.csv",
                header = T, # lets name the columns
                sep = ";") # lets separate every entry by ";"

# this is to send you guys the row_nums for all cases where hits is NA. 
# More on this, at the very end.
dat.deliverable = dat

Lets summarize/inspect the data:

head(dat)
##   row_num locale day_of_week hour_of_day agent_id entry_page path_id_set
## 1  988681     L6      Monday          17        1       2111     31672;0
## 2  988680     L2    Thursday          22       10       2113     31965;0
## 3  988679     L4    Saturday          21        2       2100     0;78464
## 4  988678     L3    Saturday          19        8       2113       51462
## 5  988677     L2     Tuesday           6       10       2116     31931;0
## 6  988676     L3      Monday           1        8       2100           0
##   traffic_type session_durantion hits
## 1            6              7037  \\N
## 2            2                49   14
## 3            1              1892   14
## 4            6                 0    1
## 5            1                 2    3
## 6            1                 0    2
summary(dat)
##     row_num       locale         day_of_week      hour_of_day  
##  Min.   :     1   L1: 41568   Friday   :127740   Min.   : 0.0  
##  1st Qu.:247171   L2:172904   Monday   :158685   1st Qu.: 8.0  
##  Median :494341   L3:352454   Saturday :119350   Median :14.0  
##  Mean   :494341   L4:131271   Sunday   :144330   Mean   :13.2  
##  3rd Qu.:741511   L5:112054   Thursday :136476   3rd Qu.:19.0  
##  Max.   :988681   L6:178430   Tuesday  :155001   Max.   :23.0  
##                               Wednesday:147099                 
##     agent_id        entry_page    path_id_set      traffic_type   
##  Min.   : 0.000   Min.   :2100   0      : 54444   Min.   : 1.000  
##  1st Qu.: 6.000   1st Qu.:2111   38715;0: 11002   1st Qu.: 1.000  
##  Median : 9.000   Median :2113   34741;0:  8958   Median : 2.000  
##  Mean   : 7.351   Mean   :2253   34812;0:  7408   Mean   : 2.774  
##  3rd Qu.:10.000   3rd Qu.:2116   78506;0:  6242   3rd Qu.: 4.000  
##  Max.   :15.000   Max.   :8101   34227;0:  5614   Max.   :10.000  
##                                  (Other):895013                   
##  session_durantion     hits        
##  0      :115579    \\N    :369446  
##  2      : 43906    3      : 98274  
##  3      : 40324    4      : 65158  
##  4      : 33648    2      : 50675  
##  5      : 26183    5      : 37135  
##  1      : 21331    1      : 29381  
##  (Other):707710    (Other):338612

We could use View(dat) if we wanted to take a closer look. The dataset is very large, hence, it wouldn’t be that convenient.

A few fixes before we start

  1. Fixing a typo
colnames(dat)[9] # typo (i.e. session_durantion).
## [1] "session_durantion"
colnames(dat)[9] <- "session_duration" # now it's fixed.
  1. NAs in the DV. Currently, NAs are denoted by “\N”. Will change it to “NA.” Note: The “\” symbol is an escape character. That’s why I escaped it.
dat$hits[dat$hits == "\\N"] <- NA

Now, lets drop all rows that contain NAs in the dataset. We don’t want to have NAs in the dependent variable either.

dat = na.omit(dat) # deletes NAs.
  1. Variable “hits” is factor. Since we care for the number (i.e. counts) of hits, will change it to numeric.
is(dat$hits)[1] # lets ask R what "hits" is.
## [1] "factor"
dat$hits = as.numeric(dat$hits) # for the rest of the assignment, will treat it as numeric
  1. Seconds are factor. Will change it to numeric.
is(dat$session_duration)[1] # lets ask R what "session_duration" is.
## [1] "factor"
dat$session_duration = as.numeric(dat$session_duration) # better as numeric.
  1. The device used for the session (i.e. “agent_id”) is numeric. It should be factor.
is(dat$agent_id)[1] # lets ask R what "agent_id" is.
## [1] "integer"
dat$agent_id = as.factor(dat$agent_id) # better as factor.
  1. The factor variable “day of the week” is not ordered. Statistically, it won’t matter, but we might need it ordered if we wanted to plot the predicted weekly hits. Lets fix it.
levels(dat$day_of_week) # see, levels are not ordered. 
## [1] "Friday"    "Monday"    "Saturday"  "Sunday"    "Thursday"  "Tuesday"  
## [7] "Wednesday"
dat$day_of_week = factor(dat$day_of_week,levels(dat$day_of_week)[c(2,6,7,5,1,3,4)])
  1. “Entry page” is numeric. It makes much more sense to think about every landing page as a factor variable instead.
is(dat$entry_page)[1] # lets check what this variable type is.
## [1] "integer"
dat$entry_page = as.factor(dat$entry_page) # lets change that to factor.

Now, it seems that a large part of this distribution, is concentrated in landing pages 2100, 2111, 2113, 2114, 2115 and 2116.

table(dat$entry_page)[1:6]
## 
##   2100   2111   2113   2114   2115   2116 
## 110762  51932 205968  44589  10655 125338
# loading lattice, for a "nicer" histogram
p_load(lattice)
histogram(dat$entry_page, xlab = "Entry Page", breaks = length(levels(dat$entry_page)))

To make a more parsimonius model with less intercepts, I will recode this variable as follows. Will keep “landing pages” 1 to 6, but will create a 7th landing page coded as “0.” This residual category will group all other sessions, which many times, had just a couple of sessions only. Lets recode the variable.

levels(dat$entry_page)[levels(dat$entry_page)>"2116"] <- "0"

Now, lets plot the distribution of levels of the new variable:

p_load(lattice)
histogram(dat$entry_page, xlab = "Entry Page")

Ok. This looks way better. Instead of having almost 150 landing pages, we have simplified this variable to the most important ones.

  1. Variable “path_id_set”, as a variable, is kind of noisy as it is.
head(dat$path_id_set)
## [1] 31965;0 0;78464 51462   31931;0 0       0      
## 151268 Levels:  0 ... 99998;0

It makes more sense to think of this variable as a count of “locations that were visited during the session” rather than the locations themselves. This is because is hard to model as a function of really specific and almost unique locations.

First, lets count the number of strings in this (as is) character variable. And then, lets sum up all locations.

p_load(stringr)
dat$path_id_set = str_count(as.character(dat$path_id_set), '\\d+')

Now the variable is a count variable.

Note: I assumed that the number 0, or “location 0” (very frequent, btw) was an actual location among other “locations that were visited during the session”.

That said, now “path_id_set” is almost only 2’s. That is, almost everyone visited two locations.

p_load(lattice)
histogram(dat$path_id_set, breaks = 100)

To make a simpler model, I will dichotomize “path_id_set”: now observations are either 2 (1) or not (0). Either “i” visited two locations, or not.

dat$path_id_set = as.numeric(dat$path_id_set == 2)

p_load(lattice)
histogram(as.factor(dat$path_id_set), xlab = "Path Id Set")

  1. Variable “traffic_type” is numeric. It makes much more sense to think of this variable as a factor (email, search engine, etc.).
is(dat$traffic_type)[1] # is numeric.
## [1] "integer"
dat$traffic_type = as.factor(dat$traffic_type) # Better as factor.
  1. Will change the “day_of_week” variable to numeric (1 = Monday…till 7 = Sunday). Theoretically, I think it makes more sense to think about this variable as numeric. Lets think of something like “the degree in which weekend is approaching.” Perhaps, consumers make vacations plans (and visit www.Trivago.com) during weekends. Will see…
dat$day_of_week.num = as.numeric(as.factor(dat$day_of_week))

# testing
data.frame(
  old = dat$day_of_week,
  new = dat$day_of_week.num
)[100:110,] # just to take a look, from row 100 to row 110. OK, it looks good.
##           old new
## 100    Monday   1
## 101    Monday   1
## 102  Thursday   4
## 103  Saturday   6
## 104 Wednesday   3
## 105 Wednesday   3
## 106 Wednesday   3
## 107  Thursday   4
## 108  Saturday   6
## 109    Friday   5
## 110    Monday   1

Descriptives

p_load(lattice)
histogram(dat$hits) # plot the DV

First impressions: of course, it doesn’t look like a normal distribution. It can’t be, anyways. The data generating process (the “number of hits”) might follow a Poisson distribution. That is, counts in the form of discrete positive numbers. But lets take a closer look at dependent variable. Normal distributions have the same mean, mean, and mode. Lets see:

mean(dat$hits) # Mean
## [1] 313.0943
median(dat$hits) # Median
## [1] 286
getmode <- function(v) { # Little function to get the mode.
        uniqv <- unique(v)
        uniqv[which.max(tabulate(match(v, uniqv)))]
        }
getmode(dat$hits) # Mode
## [1] 286

OK. While visual inspection suggests that the DV is not normally distributed, the summary statistics shown above suggest that hits is pretty much normally distributed. In any case, when I get to the modeling stage, I will still consider a Poisson and a Negative Binomial models (and their corresponding diagnostics) for completeness.

Few more modifications before we start modeling: Variable names mess with both TeX and HTML KNITR compilation. Will replace "_" for “.”

colnames(dat)[which(names(dat) == "row_num")] <- "row.num"
colnames(dat)[which(names(dat) == "day_of_week")] <- "day.of.week"
colnames(dat)[which(names(dat) == "agent_id")] <- "agent.id"
colnames(dat)[which(names(dat) == "entry_page")] <- "entry.page"
colnames(dat)[which(names(dat) == "path_id_set")] <- "path.id.set"
colnames(dat)[which(names(dat) == "traffic_type")] <- "traffic.type"
colnames(dat)[which(names(dat) == "session_duration")] <- "session.duration"
colnames(dat)[which(names(dat) == "day_of_week.num")] <- "day.of.week.num"
colnames(dat)[which(names(dat) == "hour_of_day")] <- "hour.of.day"

save(dat, file = "dat.RData")

Models

In this section, I will fit the data using several modeling strategies. My general approach will be the following:

  1. Define a full linear model with all possible variables.
  2. Using this full linear model, find the best possible predictors in a stepwise fashion.
  3. With the selected variables, estimate another linear model.
  4. Run standard diagnostics, inspect fit, check assumptions (heteroskedasticity, etc.).
  5. Using the best predictors found in the stepwise process, consider estimating a robust linear model.
  6. Next, move to GLMs. Investigate the plausibility of a Poisson and/or a Negative Binomial models.
  7. Conclude, and recommend the most efficient model.

Lets define the full equation.

formula.all = as.formula(
        hits ~ 
                locale + 
                hour.of.day + 
                agent.id + 
                entry.page + 
                path.id.set + 
                traffic.type + 
                session.duration + 
                day.of.week.num
        )

Lets fit the data via a linear model.

all.vars.model.ols = lm(formula.all, dat)

Now, before discussing the output of the unrestricted model, lets first find the most efficient subset of variables. Criteria will be minimizing the RMSE. Next, I use the ols_step_best_subset function. Usually stepwise regression selects variables with the lowest p-values. Hence, I decided to take a slighlty different approach by focusing on the RMSE instead.

Note: (a) this might take a while, depending on your machine. Also, (b) the outout is very wide, and might not fit the PDF. That’s why I’m sending also a HTML file, which you guys may see it from almost every web browser. The HTML will be available in my website www.HectorBahamonde.com/Trivago too. I designed my website using GitHub too, to ensure full compatibility with r, LaTeX, Markdown, etc. There is also a GitHub repo (https://github.com/hbahamonde/trivago) where you can follow my work. I strongly suggest you to take a look at the HTML file (either the one I sent, or the one that’s published—both are the same). For instance, regarding the following output, since it is so wide, the column we care about (MSEP: Estimated error of prediction) will not show up in the PDF.

p_load(olsrr) # for "ols_step_best_subset" function.
ols_step_best_subset(all.vars.model.ols)
##                                             Best Subsets Regression                                            
## ---------------------------------------------------------------------------------------------------------------
## Model Index    Predictors
## ---------------------------------------------------------------------------------------------------------------
##      1         session.duration                                                                                 
##      2         entry.page session.duration                                                                      
##      3         entry.page path.id.set session.duration                                                          
##      4         entry.page path.id.set traffic.type session.duration                                             
##      5         agent.id entry.page path.id.set traffic.type session.duration                                    
##      6         locale agent.id entry.page path.id.set traffic.type session.duration                             
##      7         locale hour.of.day agent.id entry.page path.id.set traffic.type session.duration                 
##      8         locale hour.of.day agent.id entry.page path.id.set traffic.type session.duration day.of.week.num 
## ---------------------------------------------------------------------------------------------------------------
## 
##                                                               Subsets Regression Summary                                                               
## -------------------------------------------------------------------------------------------------------------------------------------------------------
##                        Adj.        Pred                                                                                                                 
## Model    R-Square    R-Square    R-Square       C(p)           AIC             SBIC            SBC            MSEP          FPE         HSP       APC  
## -------------------------------------------------------------------------------------------------------------------------------------------------------
##   1        0.0294      0.0294      0.0294    20737.3409    8395513.2046    6638200.2677    8395547.2134    45251.8443    45251.8443    0.0731    0.9706 
##   2        0.0493      0.0493      0.0493     7599.9994    8382679.2163    6625356.3381    8382781.2425    44323.2691    44323.2691    0.0716    0.9507 
##   3        0.0571      0.0571       0.057     2507.2953    8377627.6694    6620304.8320    8377741.0318    43963.1630    43963.1630    0.0710    0.9429 
##   4        0.0586      0.0586      0.0586     1477.5817    8376611.2496    6619278.4207    8376792.6294    43890.7061    43890.7061    0.0709    0.9414 
##   5        0.0601      0.0601        0.06      505.5815    8375666.7891    6618307.9743    8376006.8763    43822.8948    43822.8948    0.0708    0.9399 
##   6        0.0609      0.0609      0.0608      -12.6314    8375156.7723    6617789.9677    8375553.5407    43786.5333    43786.5333    0.0707    0.9391 
##   7        0.0609      0.0609      0.0608      -18.0892    8375151.3141    6617784.5098    8375559.4188    43786.1473    43786.1473    0.0707    0.9391 
##   8        0.0609      0.0609      0.0608      -18.0000    8375151.4032    6617784.5990    8375570.8441    43786.1536    43786.1536    0.0707    0.9391 
## -------------------------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria 
##  SBIC: Sawa's Bayesian Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
##  MSEP: Estimated error of prediction, assuming multivariate normality 
##  FPE: Final Prediction Error 
##  HSP: Hocking's Sp 
##  APC: Amemiya Prediction Criteria

The MSEP carries almost the same information as the RMSE. You can try calculating the square root of the MSEP to obtain the RMSE. Hence I’ll use the MSEP to select the most efficient variables. Models 7 and 8 have statistically speaking, identical predicted errors. Since I’m just scratching the surface now, I will select model 8 which has all possible variables, and begin modeling/cleaning from there.

But lets now calculate the actual RMSE of the full model.

sqrt(rev(anova(all.vars.model.ols)$"Mean Sq")[1]) # RMSE
## [1] 209.2499

OK. Now lets see how we are doing good-of-fit wise.

p_load(lattice)
xyplot(predict.lm(all.vars.model.ols, type="response") ~ dat$hits,
       xlab = "Observed Hits", 
       ylab = "Predicted Hits",
       main = "Model 1: Observed v. Fitted",
       panel = function(x, ...) {
               panel.xyplot(x, ..., alpha = 0.2)
               panel.lmline(x, y, ...)
               }
       )

Not so good. Lots of influence (upper-left corner of plot) and leverage problems here—pretty much, just across all the x-axis. My take, not a good fit (i.e. not a good enough RMSE).

Lets now plot the error to check normality and homoscedasticity, that is, constant variance on the predicted residuals. Given the poor fit, we can’t anticipate promising results at this stage.

p_load(lattice)
xyplot(all.vars.model.ols$residuals ~ dat$hits,
       xlab = "Observed Hits", 
       ylab = "Residuals",
       main = "Model 1: Residual Plot", 
       panel = function(x, ...) {
               panel.xyplot(x, ..., alpha = 0.2)
               }
       )

Huge problems here. We should be able to see no patterns, which we do: the residuals go up as the X variable goes up too. That’s not a good sign.

One problem of this dataset is that it might have extreme observations (outliers). In this section, I will try to detect those, and evaluate what can we do with those.

Lets get Cook’s distance (a typical measurement of influencial data points), and merge it into our main dataset.

options(scipen=100000000) # increases threshold for scientific notation (it's got very small numbers).
dat$cooks.all.ols <- cooks.distance(all.vars.model.ols) 

Lets also summarize the Cooks

summary(dat$cooks.all.ols)
##         Min.      1st Qu.       Median         Mean      3rd Qu. 
## 0.0000000000 0.0000000884 0.0000005079 0.0000014147 0.0000016880 
##         Max. 
## 0.0018523941

And also, lets plot the CooksD.

p_load(lattice)
xyplot(dat$cooks.all.ols ~ predict(all.vars.model.ols,  type="response"),
       grid = TRUE,
       xlab = "Predicted Hits", 
       ylab = "Cooks D",
       main = "Model 1: Influencial Observations (Cooks D)",
       panel = function(x, ...) {
               panel.xyplot(x, ..., alpha = 0.2)
               }
       )

My recomnendation is that we drop all observations that fall above the 3rd quantile. That would mean to keep 464426 observations. In big data settings, that’s descent enough.

Lets subset the data then.

dat.cook = subset(dat, cooks.all.ols < as.numeric(summary(dat$cooks.all.ols)[5])) 

Now we have a smaller DF, but with less influence problems. Lets fit Model 2, all variables, but using the CooksD-corrected dataset (i.e. dat.cook).

all.vars.model.ols.cook = lm(formula.all, dat.cook)

OK. Now, lets inspect Cooks distance again.

options(scipen=100000000) # increases threshold for scientific notation (it's got very small numbers).
dcooks.distance.ols.fixed <- cooks.distance(all.vars.model.ols.cook) # getting Cook's distance.
# plot
p_load(lattice)
xyplot(dcooks.distance.ols.fixed ~ predict(all.vars.model.ols.cook,  type="response"),
       grid = TRUE,
       xlab = "Predicted Hits", 
       ylab = "Cooks D",
       main = "Model 2: Influencial Observations (Influential Obs. Dropped)",
       panel = function(x, ...) {
               panel.xyplot(x, ..., alpha = 0.2)
       }
)

Now the problem can be tolerated. But there are still some problems. More on that below. Lets now inspect the RMSE of Model 2.

sqrt(rev(anova(all.vars.model.ols.cook)$"Mean Sq")[1]) # RMSE
## [1] 133.3775

Not surpsingly, this RMSE is way better than Model 1. Lets see now how we are doing goodness-of-fit wise.

p_load(lattice)
xyplot(predict(all.vars.model.ols.cook,  type="response") ~ dat.cook$hit,
       grid = TRUE,
       xlab = "Observed Hits", 
       ylab = "Predicted Hits",
       main = "Model 2: Full Model Without Most Influencial Observations",
       panel = function(x, ...) {
               panel.xyplot(x,...)
               panel.lmline(x, ...)
               panel.xyplot(x, ..., alpha = 0.2)
               }
       )

OK. It does look better now. However, we now see that we also have leverage issues.

Now I will concentrate on the DFFITs.

p_load(olsrr)
ols_plot_dffits(all.vars.model.ols.cook)

Lots of problems here. Lets drop all observations that landed below/above the threshold. First, lets calculate the DFFFITs, so we merge them into our dataset.

dat.cook$dffits.ols = dffits(all.vars.model.ols.cook)

Second, lets plot the distribution of DFFITs, so we actually see what we are about to drop.

densityplot(dat.cook$dffits.ols,
            xlab = "Dffits", 
            main = "Density of Dffits Resulting of Fitting Model 2\n(Without Most Influencial Observations Dropped)",
            panel = function(...) {
                    panel.densityplot(...)
                    panel.abline(v = c(0.02, -0.02))
                    }
            )

The main take away is the following: if we chop the top ends of the distribution (i.e. the ones with high leverage), we still get the most of information out of this distribution. Lets drop these data points with high leverage (306 observations). Actually, is not a lot, but lets see if we improve fit, decreasing error, and getting a better RMSE.

First, lets drop data points with high/low leverage:

dat.cook.dffits = subset(dat.cook, dffits.ols > - 0.02 & dffits.ols < 0.02)

Next, lets fit Model 3, now without high/low DFFITS (and certainly, without high Cooks Distance values). Please remember that we are still working with the already CooksD-corrected dataset.

all.vars.model.ols.cook.dffits = lm(formula.all, dat.cook.dffits)

Now, lets get those DFFITs, so we can plot them.

options(scipen=100000000) # increases threshold for scientific notation (it's got very small numbers).
dffits.dcooks.distance.ols.fixed <- dffits(all.vars.model.ols.cook.dffits) # getting DFFITs.

# plot
p_load(lattice)
xyplot(dffits.dcooks.distance.ols.fixed ~ predict(all.vars.model.ols.cook.dffits,  type="response"),
       grid = TRUE,
       xlab = "Predicted Hits", 
       ylab = "DFFITS",
       main = "Full Model with Influential Obs. Dropped\n(Low CooksD, and no extreme DFFITs)",
       panel = function(x, ...) {
               panel.xyplot(x, ..., alpha = 0.3)
               }
       )

Now the problem can be tolerated. Lets get now those RMSEs.

sqrt(rev(anova(all.vars.model.ols.cook.dffits)$"Mean Sq")[1]) # RMSE
## [1] 133.399

Not much better. It’s almost the same as Model 2, at the price of loosing a couple hundreads observations. In fact, the RMSE is a little bit higher. Lets insepct fit.

p_load(lattice)
xyplot(predict(all.vars.model.ols.cook.dffits,  type="response") ~ dat.cook.dffits$hit,
       grid = TRUE,
       xlab = "Observed Hits", 
       ylab = "Predicted Hits",
       main = "Model 3: Full Model Without Most Influencial Observations",
       panel = function(x,
                        y, ...) {
               panel.xyplot(x, y, ...)
               panel.lmline(x, y, ...)
               }
       )

OK, so far we’ve fitted three models: (1) full, (2) good Cooks, and (3) good Cooks and good DFFITs. Lets use this truncated DF (i.e. dat.cook.dffits) to take the stepwise regression route again.

p_load(olsrr)
ols_step_best_subset(all.vars.model.ols.cook.dffits)
##                                             Best Subsets Regression                                            
## ---------------------------------------------------------------------------------------------------------------
## Model Index    Predictors
## ---------------------------------------------------------------------------------------------------------------
##      1         entry.page                                                                                       
##      2         entry.page session.duration                                                                      
##      3         entry.page path.id.set session.duration                                                          
##      4         entry.page path.id.set traffic.type session.duration                                             
##      5         agent.id entry.page path.id.set traffic.type session.duration                                    
##      6         locale agent.id entry.page path.id.set traffic.type session.duration                             
##      7         locale hour.of.day agent.id entry.page path.id.set traffic.type session.duration                 
##      8         locale hour.of.day agent.id entry.page path.id.set traffic.type session.duration day.of.week.num 
## ---------------------------------------------------------------------------------------------------------------
## 
##                                                               Subsets Regression Summary                                                               
## -------------------------------------------------------------------------------------------------------------------------------------------------------
##                        Adj.        Pred                                                                                                                 
## Model    R-Square    R-Square    R-Square       C(p)           AIC             SBIC            SBC            MSEP          FPE         HSP       APC  
## -------------------------------------------------------------------------------------------------------------------------------------------------------
##   1        0.0990      0.0990       0.099    58491.2203    5914415.0189    4597289.0421    5914503.4021    20038.2107    20038.2107    0.0432    0.9010 
##   2        0.1757      0.1757      0.1757    14007.7060    5873126.6508    4556000.9693    5873226.0819    18332.5917    18332.5917    0.0395    0.8243 
##   3        0.1911      0.1910       0.191     5118.2793    5864416.3449    4547290.7535    5864526.8239    17991.7458    17991.7458    0.0388    0.8090 
##   4        0.1945      0.1945      0.1945     3109.0250    5862434.7191    4545299.1487    5862611.4855    17914.8982    17914.8982    0.0386    0.8055 
##   5        0.1972      0.1972      0.1972     1536.1104    5860895.6080    4543734.0647    5861227.0450    17855.0873    17855.0873    0.0385    0.8028 
##   6        0.1999      0.1998      0.1998       -2.4031    5859367.5913    4542198.0878    5859754.2678    17796.2465    17796.2465    0.0383    0.8001 
##   7        0.1999      0.1999      0.1998      -18.0936    5859351.8998    4542182.3969    5859749.6242    17795.6448    17795.6448    0.0383    0.8001 
##   8        0.1999      0.1999      0.1998      -18.0000    5859351.9933    4542182.4906    5859760.7655    17795.6484    17795.6484    0.0383    0.8001 
## -------------------------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria 
##  SBIC: Sawa's Bayesian Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
##  MSEP: Estimated error of prediction, assuming multivariate normality 
##  FPE: Final Prediction Error 
##  HSP: Hocking's Sp 
##  APC: Amemiya Prediction Criteria

If we want to maximize parsimony and minize RMSE, now model 6 does the best job (with 2 less predictors now). Lets define a new formula then, with the best predictors.

formula.best = as.formula(hits ~ 
                                  locale + 
                                  agent.id + 
                                  entry.page + 
                                  path.id.set + 
                                  traffic.type + 
                                  session.duration
                          )

Lets use this formula to fit Model 4, using the dat.cook.dffits dataset.

best.vars.model.ols = lm(formula.best, dat.cook.dffits)

Lets inspect goodness of fit.

p_load(lattice)
xyplot(predict(best.vars.model.ols,  type="response") ~ dat.cook.dffits$hit,
       grid = TRUE,
       xlab = "Observed Hits", 
       ylab = "Predicted Hits",
       main = "Model 4: Linear Model Without Most Influencial Observations\nReduced Form Model, after stepwise reg.",
       panel = function(x,
                y, ...) {
               panel.xyplot(x, y, ...)
               panel.lmline(x, y, ...)
               }
       )

Checking homoscedasticity.

ols_plot_cooksd_bar(best.vars.model.ols) # Cooks D

ols_plot_dffits(best.vars.model.ols) # DFFITs

Cooks D look good. DFFITs look just OK, but good enough. Most leverage observations land just below/above the threshold. We do have around 10 observations (little less perhaps) landing way below the negative threshold.

Lets get now those RMSEs.

sqrt(rev(anova(best.vars.model.ols)$"Mean Sq")[1]) # RMSE
## [1] 133.4016

Not much better. It’s almost the same as Model 3, after loosing a couple hundreads obs. In fact, the RMSE is a little bit higher. Lets try a different strategy.

Now, lets turn to the robust regression. These models weight the contribution of every data point by the size of the standard error: the more far away a data point is (i.e. the more error a data point brings into the model), the less “importance” that data point will have. This strategy works in cases like this, where we have leverage/influence issues.

# convert this, as it messes up with the LaTex dollar sign.
dat$agent.id = as.numeric(as.character(dat$agent.id))

p_load(MASS)
rlm.best.vars.entire.df = rlm(
        hits ~ locale + agent.id + entry.page + path.id.set + traffic.type + session.duration, 
        data = dat, 
        psi = psi.bisquare
        )

Lets evaluate the RMSE.

p_load(qpcR)
RMSE(rlm.best.vars.entire.df) # RMSE
## [1] 209.9251

Not a very good job. Lets consider Model 6, where we employ the same robust approach, but with the truncated dataset.

dat.cook.dffits$agent.id = as.numeric(as.character(dat.cook.dffits$agent.id))

p_load(MASS)
rlm.best.vars.truncated.df = rlm(
        hits ~ locale + agent.id + entry.page + path.id.set + traffic.type + session.duration, 
        data = dat.cook.dffits,
        psi = psi.bisquare
        )

Lets now calculate the RMSE.

p_load(qpcR)
RMSE(rlm.best.vars.truncated.df) # RMSE 
## [1] 133.7005

OK; this is better but not the the best (compared to Model 2).

Lets consider now GLMs. We will focus on two kinds, Poisson and Negative Binomial.

Does hits look like a (theoretical) Poisson distribution?

p_load(vcd)
distplot(dat$hits, type="poisson")

When the empirical distribution looks alike the theoretical one, we should see a straight line (as we do here). To answer my own question, yes, it definetively looks def like a Poisson distribution.

Lets inspect if hits look like a Negative Binomial distribution.

p_load(vcd)
distplot(dat$hits, type="nbinom")

In this case, it definetively does not look like a Negative Binomial distribution. Lets anyways fit both models.

First, we will do a stepwise regression fitting a Poisson model. Lets run the full model first.

full.model.poisson = glm(hits ~ locale + 
                             hour.of.day +
                             agent.id + 
                             entry.page +
                             path.id.set + 
                             traffic.type + 
                             session.duration + 
                             day.of.week.num,
                     data = dat, 
                     family = poisson(link = "log")
                     )

Second, lets proceed with the stepwise, and try to find the most efficient predictors.

step(full.model.poisson, direction = "both")
## Start:  AIC=100012561
## hits ~ locale + hour.of.day + agent.id + entry.page + path.id.set + 
##     traffic.type + session.duration + day.of.week.num
## 
##                    Df Deviance       AIC
## <none>                95624062 100012561
## - day.of.week.num   1 95624226 100012723
## - hour.of.day       1 95624795 100013292
## - agent.id          1 95658954 100047451
## - locale            5 95695021 100083509
## - traffic.type      6 95810566 100199052
## - path.id.set       1 96343996 100732493
## - entry.page        6 96576973 100965459
## - session.duration  1 97670923 102059420
## 
## Call:  glm(formula = hits ~ locale + hour.of.day + agent.id + entry.page + 
##     path.id.set + traffic.type + session.duration + day.of.week.num, 
##     family = poisson(link = "log"), data = dat)
## 
## Coefficients:
##      (Intercept)          localeL2          localeL3          localeL4  
##       5.33669114        0.03576591        0.03406074       -0.02277977  
##         localeL5          localeL6       hour.of.day          agent.id  
##       0.02879158        0.02255320       -0.00029455       -0.00365396  
##   entry.page2111    entry.page2113    entry.page2114    entry.page2115  
##       0.19329554        0.17038229        0.23047331        0.19451043  
##   entry.page2116       entry.page0       path.id.set     traffic.type2  
##       0.28444828        0.18955289        0.15919481        0.01477518  
##    traffic.type3     traffic.type4     traffic.type6     traffic.type7  
##      -0.06433595       -0.02247284       -0.06719006       -0.06611249  
##   traffic.type10  session.duration   day.of.week.num  
##      -0.15655863        0.00003122        0.00045224  
## 
## Degrees of Freedom: 619234 Total (i.e. Null);  619212 Residual
## Null Deviance:       101200000 
## Residual Deviance: 95620000  AIC: 100000000

This procedure suggest selecting all variables (as done in full.model.poisson). Lets inspect the RMSE.

p_load(qpcR)
RMSE(full.model.poisson)
## [1] 12.4267

It turns out that this RMSE is the lowest one so far. However, we should remember that RMSEs across linear and GLMs are not directly comparable; i.e. variances in GLM’s predictions aren’t constant. Lets check for overdispersion anyways.

p_load(AER)
dispersiontest(full.model.poisson)
## 
##  Overdispersion test
## 
## data:  full.model.poisson
## z = 584.04, p-value < 0.00000000000000022
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##   142.6257

There is evidence of overdispersion (estimated to be 142.4094), violating the assumption of equidispersion. Lets now fit a Negative Binomial for completeness.

nbinom.full = glm.nb(formula.all, data = dat)

Lets now inspect the RMSE.

p_load(qpcR)
RMSE(nbinom.full)
## [1] 1.065472

It’s better than the Poisson regression. Overall, the evidence employing GLMs is not conclusive. The data generating process seems to be Poisson, but we do have overdispertion problems.

Table

Lets take a quick look at the coefficients, standard errors, and r-squares. The following table summarizes all models estimated in this case study.

p_load(texreg) # use htmlreg or texreg accordingly
htmlreg(list(all.vars.model.ols, # Model 1
            all.vars.model.ols.cook, # Model 2
            all.vars.model.ols.cook.dffits, # Model 3
            best.vars.model.ols, # Model 4
            rlm.best.vars.entire.df, # Model 5
            rlm.best.vars.truncated.df, # Model 6
            full.model.poisson, # Model 7
            nbinom.full # Model 8
            ),
       custom.model.names = c(
                "(M1: Ols)",
                "(M2: Ols)",
                "(M3: Ols)",
                "(M4: Ols)",
                "(M5: Rlm)",
                "(M6: Rlm)",
                "(M7: Poisson)",
                "(M8: Neg Bin)"),
       # label = "tab:1", # Option is only for tex tables, not HTML ones.
       custom.note = list("%stars. DV: Hits"),
       #fontsize = "tiny",
       # scalebox = 0.4, # Option is only for tex tables, not HTML ones.
       center = TRUE,
       digits = 3,
       # table = TRUE, # Option is only for tex tables, not HTML ones.
       # float.pos = "h",
       caption = "Different Models to Estimate the Number of Hits"
       )
<!DOCTYPE HTML PUBLIC “-//W3C//DTD HTML 4.01 Transitional//EN” “http://www.w3.org/TR/html4/loose.dtd”>
Different Models to Estimate the Number of Hits
(M1: Ols) (M2: Ols) (M3: Ols) (M4: Ols) (M5: Rlm) (M6: Rlm) (M7: Poisson) (M8: Neg Bin)
(Intercept) 180.362*** 136.270*** 129.341*** 130.005*** 159.802 130.045 5.337*** 5.322***
(3.881) (4.298) (4.453) (4.437) (1.642) (1.331) (0.000) (0.007)
localeL2 12.363*** 5.236*** 5.183*** 4.562*** 13.100 2.960 0.036*** 0.040***
(1.468) (1.202) (1.203) (1.194) (1.499) (1.240) (0.000) (0.006)
localeL3 9.962*** 0.815 0.744 0.354 12.191 -3.718 0.034*** 0.036***
(1.380) (1.136) (1.136) (1.133) (1.425) (1.182) (0.000) (0.006)
localeL4 -7.569*** -20.885*** -20.950*** -21.479*** -6.209 -24.257 -0.023*** -0.022***
(1.502) (1.215) (1.216) (1.210) (1.534) (1.253) (0.000) (0.006)
localeL5 8.218*** 0.044 -0.023 -0.706 9.932 -2.397 0.029*** 0.039***
(1.537) (1.252) (1.253) (1.242) (1.566) (1.289) (0.000) (0.007)
localeL6 7.146*** -3.613** -3.680** -4.359*** 7.789 -7.417 0.023*** 0.028***
(1.456) (1.187) (1.188) (1.177) (1.494) (1.229) (0.000) (0.006)
hour.of.day -0.111** -0.125*** -0.125*** -0.000*** -0.000
(0.040) (0.030) (0.030) (0.000) (0.000)
agent.id1 16.540*** 0.792 7.644 6.277
(3.632) (4.148) (4.308) (4.296)
agent.id2 2.955 -9.595* -2.743 -4.048
(3.746) (4.214) (4.372) (4.361)
agent.id3 -187.131*** -139.045*** -126.653** -128.295**
(23.115) (40.430) (47.362) (47.362)
agent.id4 -201.034*** -178.236*** -171.814*** -173.155***
(17.416) (19.959) (27.585) (27.583)
agent.id5 -206.549*** -184.879*** -177.153*** -178.673***
(20.185) (29.402) (38.754) (38.753)
agent.id7 24.068*** 19.539*** 26.303*** 25.009***
(3.918) (4.336) (4.489) (4.479)
agent.id8 -0.439 -18.051*** -11.202** -12.499**
(3.673) (4.169) (4.328) (4.316)
agent.id9 -1.252 -17.761*** -10.895* -12.155**
(3.615) (4.139) (4.299) (4.289)
agent.id10 3.717 -14.711*** -7.857 -9.113*
(3.601) (4.134) (4.294) (4.284)
agent.id11 -0.906 6.071 15.024 13.764
(6.840) (7.881) (8.323) (8.317)
agent.id12 -15.246** -12.896* -10.554 -11.867
(5.537) (6.315) (6.662) (6.654)
agent.id13 -5.438 -12.851** -5.984 -7.290
(3.765) (4.234) (4.391) (4.380)
agent.id14 13.959*** 7.259 13.968** 12.652**
(3.936) (4.341) (4.494) (4.483)
agent.id15 -158.438*** -145.357*** -138.284** -139.551**
(25.665) (33.616) (42.415) (42.415)
entry.page2111 53.635*** 69.876*** 69.850*** 69.860*** 57.441 69.922 0.193*** 0.177***
(1.512) (1.139) (1.139) (1.139) (1.564) (1.188) (0.000) (0.007)
entry.page2113 46.626*** 54.641*** 54.622*** 54.649*** 49.493 52.898 0.170*** 0.149***
(1.294) (0.969) (0.970) (0.970) (1.338) (1.011) (0.000) (0.006)
entry.page2114 66.411*** 90.015*** 89.994*** 90.014*** 71.705 90.265 0.230*** 0.210***
(1.574) (1.197) (1.198) (1.198) (1.629) (1.250) (0.000) (0.007)
entry.page2115 54.971*** 80.405*** 80.176*** 80.205*** 59.457 82.459 0.195*** 0.175***
(2.350) (2.070) (2.072) (2.072) (2.437) (2.164) (0.001) (0.010)
entry.page2116 85.929*** 115.242*** 115.249*** 115.274*** 94.479 115.552 0.284*** 0.268***
(1.325) (1.004) (1.005) (1.005) (1.369) (1.047) (0.000) (0.006)
entry.page0 56.680*** 71.169*** 71.264*** 71.325*** 63.225 65.508 0.190*** 0.170***
(1.080) (0.799) (0.799) (0.799) (1.104) (0.825) (0.000) (0.005)
path.id.set 45.440*** 48.900*** 48.958*** 48.952*** 52.144 47.976 0.159*** 0.176***
(0.674) (0.518) (0.518) (0.518) (0.696) (0.539) (0.000) (0.003)
traffic.type2 8.927*** 22.898*** 22.984*** 22.941*** 13.788 23.864 0.015*** 0.009
(1.197) (0.919) (0.919) (0.919) (1.240) (0.960) (0.000) (0.005)
traffic.type3 -16.931*** -2.012 -1.888 -1.889 -13.906 1.212 -0.064*** -0.082***
(1.561) (1.236) (1.237) (1.237) (1.620) (1.292) (0.000) (0.007)
traffic.type4 -3.510** 13.171*** 13.265*** 13.285*** 3.120 15.615 -0.022*** -0.045***
(1.319) (0.995) (0.996) (0.996) (1.367) (1.038) (0.000) (0.006)
traffic.type6 -18.849*** 8.218*** 8.352*** 8.328*** -5.244 17.343 -0.067*** -0.079***
(1.318) (0.979) (0.979) (0.979) (1.353) (1.011) (0.000) (0.006)
traffic.type7 -14.131** 11.338* 8.534 8.656 -9.152 13.231 -0.066*** -0.091***
(4.920) (5.064) (5.204) (5.204) (5.104) (5.434) (0.001) (0.021)
traffic.type10 -19.496*** 52.407*** 52.580*** 53.056*** 8.566 60.415 -0.157*** -0.168***
(2.732) (1.926) (1.927) (1.921) (2.829) (2.007) (0.001) (0.012)
session.duration 0.010*** 0.012*** 0.012*** 0.012*** 0.011 0.013 0.000*** 0.000***
(0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000)
day.of.week.num 0.181 0.135 0.134 0.000*** 0.000
(0.131) (0.097) (0.097) (0.000) (0.001)
agent.id -1.078 -1.125 -0.004*** -0.004***
(0.075) (0.057) (0.000) (0.000)
R2 0.061 0.200 0.200 0.200
Adj. R2 0.061 0.200 0.200 0.200
Num. obs. 619235 464426 464120 464120 619235 464120 619235 619235
RMSE 209.250 133.378 133.399 133.402
AIC 100012561.256 8325341.413
BIC 100012821.989 8325613.483
Log Likelihood -50006257.628 -4162646.707
Deviance 95624062.491 702974.258
p < 0.001, p < 0.01, p < 0.05. DV: Hits

Plots

Lets now take a quick look at the plots.

p_load(ggplot2,dvmisc,qpcR)

ggplot(dat, aes(
        x = hits, 
        y = predict.lm(all.vars.model.ols, type="response"))) +
        geom_smooth(method = "lm", colour = "red") +
        geom_jitter(alpha = .01, width = 80, height = 10, size=0.01, shape=23) + 
        theme_bw() + 
        ggtitle("M1: All Variables, Full Dataset (OLS)") + 
        scale_y_continuous(name='Hits Predicted') +
        scale_x_continuous(name='Hits Observed') +
        labs(subtitle = paste("RMSE: ", round(sqrt(get_mse(all.vars.model.ols)), 3))) + 
        theme(axis.text.y = element_text(size=7), 
              axis.text.x = element_text(size=7), 
              axis.title.y = element_text(size=7), 
              axis.title.x = element_text(size=7), 
              legend.text=element_text(size=12), 
              legend.title=element_text(size=12),
              plot.title = element_text(size=7),
              legend.position="bottom",
              plot.subtitle=element_text(size=7, face="italic")) 

ggplot(dat.cook, aes(
        x = hits, 
        y = predict.lm(all.vars.model.ols.cook, type="response"))) + 
        geom_smooth(method = "lm", colour = "red") +
        geom_jitter(alpha = .01, width = 80, height = 10, size=0.01, shape=23) +
        theme_bw() + 
        ggtitle("M2: All Variables, Fixed CooksD Dataset (OLS)") + 
        scale_y_continuous(name='Hits Predicted') +
        scale_x_continuous(name='Hits Observed') +
        labs(subtitle = paste("RMSE: ", round(sqrt(get_mse(all.vars.model.ols.cook)), 3))) + 
        theme(axis.text.y = element_text(size=7), 
              axis.text.x = element_text(size=7), 
              axis.title.y = element_text(size=7), 
              axis.title.x = element_text(size=7), 
              legend.text=element_text(size=12), 
              legend.title=element_text(size=12),
              plot.title = element_text(size=7),
              legend.position="bottom",
              plot.subtitle=element_text(size=7, face="italic"))

ggplot(dat.cook.dffits, aes(
        x = hits, 
        y = predict.lm(all.vars.model.ols.cook.dffits, type="response"))) + 
        geom_smooth(method = "lm", colour = "red") +
        geom_jitter(alpha = .01, width = 80, height = 10, size=0.01, shape=23) +
        theme_bw() + 
        ggtitle("M3: All Variables, Fixed CooksD and DFFITs Dataset (OLS)") + 
        scale_y_continuous(name='Hits Predicted') +
        scale_x_continuous(name='Hits Observed') +
        labs(subtitle = paste("RMSE: ", round(sqrt(get_mse(all.vars.model.ols.cook.dffits)), 3))) + 
        theme(axis.text.y = element_text(size=7), 
              axis.text.x = element_text(size=7), 
              axis.title.y = element_text(size=7), 
              axis.title.x = element_text(size=7), 
              legend.text=element_text(size=12), 
              legend.title=element_text(size=12),
              plot.title = element_text(size=7),
              legend.position="bottom",
              plot.subtitle=element_text(size=7, face="italic"))

ggplot(dat.cook.dffits, aes(
        x = hits, 
        y = predict.lm(best.vars.model.ols, type="response"))) + 
        geom_smooth(method = "lm", colour = "red") +
        geom_jitter(alpha = .01, width = 80, height = 10, size=0.01, shape=23) +
        theme_bw() + 
        ggtitle("M4: Best Variables, Fixed CooksD and DFFITs Dataset (OLS)") + 
        scale_y_continuous(name='Hits Predicted') +
        scale_x_continuous(name='Hits Observed') +
        labs(subtitle = paste("RMSE: ", round(sqrt(get_mse(best.vars.model.ols)), 3))) + 
        theme(axis.text.y = element_text(size=7), 
              axis.text.x = element_text(size=7), 
              axis.title.y = element_text(size=7), 
              axis.title.x = element_text(size=7), 
              legend.text=element_text(size=12), 
              legend.title=element_text(size=12),
              plot.title = element_text(size=7),
              legend.position="bottom",
              plot.subtitle=element_text(size=7, face="italic"))

ggplot(dat, aes(
        x = hits, 
        y = predict.lm(rlm.best.vars.entire.df, type="response"))) + 
        geom_smooth(method = "lm", colour = "red") +
        geom_jitter(alpha = .01, width = 80, height = 10, size=0.01, shape=23) +
        theme_bw() + 
        ggtitle("M5: All Variables, Full Dataset (Robust Regression)") + 
        scale_y_continuous(name='Hits Predicted') +
        scale_x_continuous(name='Hits Observed') +
        labs(subtitle = paste("RMSE: ", round(RMSE(rlm.best.vars.entire.df), 3))) +
        theme(axis.text.y = element_text(size=7), 
              axis.text.x = element_text(size=7), 
              axis.title.y = element_text(size=7), 
              axis.title.x = element_text(size=7), 
              legend.text=element_text(size=12), 
              legend.title=element_text(size=12),
              plot.title = element_text(size=7),
              legend.position="bottom",
              plot.subtitle=element_text(size=7, face="italic"))

ggplot(dat.cook.dffits, aes(
        x = hits, 
        y = predict.lm(rlm.best.vars.truncated.df, type="response"))) + 
        geom_smooth(method = "lm", colour = "red") +
        geom_jitter(alpha = .01, width = 80, height = 10, size=0.01, shape=23) +
        theme_bw() + 
        ggtitle("M6: Best Variables, Fixed CooksD and DFFITs Dataset") + 
        scale_y_continuous(name='Hits Predicted') +
        scale_x_continuous(name='Hits Observed') +
        labs(subtitle = paste("RMSE: ", round(RMSE(rlm.best.vars.truncated.df), 3))) +
        theme(axis.text.y = element_text(size=7), 
              axis.text.x = element_text(size=7), 
              axis.title.y = element_text(size=7), 
              axis.title.x = element_text(size=7), 
              legend.text=element_text(size=12), 
              legend.title=element_text(size=12),
              plot.title = element_text(size=7),
              legend.position="bottom",
              plot.subtitle=element_text(size=7, face="italic"))

ggplot(dat, aes(
        x = hits, 
        y = predict(full.model.poisson, type="response"))) + 
        geom_smooth(method = "lm", colour = "red") +
        geom_jitter(alpha = .01, width = 80, height = 10, size=0.01, shape=23) +
        theme_bw() + 
        ggtitle("M7: All Variables, Full Dataset (Poisson Regression)") + 
        scale_y_continuous(name='Hits Predicted') +
        scale_x_continuous(name='Hits Observed') +
        labs(subtitle = paste("RMSE: ", round(RMSE(full.model.poisson),3))) + 
        theme(axis.text.y = element_text(size=7), 
              axis.text.x = element_text(size=7), 
              axis.title.y = element_text(size=7), 
              axis.title.x = element_text(size=7), 
              legend.text=element_text(size=12), 
              legend.title=element_text(size=12),
              plot.title = element_text(size=7),
              legend.position="bottom",
              plot.subtitle=element_text(size=7, face="italic"))

ggplot(dat, aes(
        x = hits, 
        y = predict(nbinom.full, type="response"))) + 
        geom_smooth(method = "lm", colour = "red") +
        geom_jitter(alpha = .01, width = 80, height = 10, size=0.01, shape=23) + 
        theme_bw() + 
        ggtitle("M8: All Variables, Full Dataset (Negative Binomial Reg.)") + 
        scale_y_continuous(name='Hits Predicted') +
        scale_x_continuous(name='Hits Observed') +
        labs(subtitle = paste("RMSE: ", round(RMSE(nbinom.full),3))) + 
        theme(axis.text.y = element_text(size=7), 
              axis.text.x = element_text(size=7), 
              axis.title.y = element_text(size=7), 
              axis.title.x = element_text(size=7), 
              legend.text=element_text(size=12), 
              legend.title=element_text(size=12),
              plot.title = element_text(size=7),
              legend.position="bottom",
              plot.subtitle=element_text(size=7, face="italic"))


Conclusion

According to all these analyses, Model 2 (i.e. linear, with corrected Cook’s Distances) does the job the best. It has also the best fit.

Wrapping Up

I had lots of fun doing this! Ok, now I will generate the first deliverable.

  1. Two columns: (1) row_num and hits, for all which hits is NA.
dat.deliverable$hits[dat.deliverable$hits == "\\N"] <- NA # Declare NAs

csv.dat = data.frame( # Generate the DF
        row_num = dat.deliverable$row_num[is.na(dat.deliverable$hits)],
        hits = dat.deliverable$hits[is.na(dat.deliverable$hits)]
        )

write.csv(csv.dat,'deliverable1.csv') # Export DF to CSV
  1. The second deliverable will be the .rmd file.

  2. The third one, will be the PDF that the .rmd generates.

  3. Don’t forget that www.HectorBahamonde.com/Trivago contains the HTML version of this code/output (this is important so you can see wide outputs). Unfortunately, Zealpath does not allow me to send a fourth file.